The goals of this exercise are to: - Create a multiple sequence alignment for phylogenetic analysis - Get comfortable writing and submitting your own job scripts - Collect sequences from a database for your own phylogenetic analysis
Our first step will be to analyze data for one of the three genes considered in this study:
Cartwright, P., Evans, N. M., Dunn, C. W., Marques, A. C., Miglietta, M. P., Schuchert, P., & Collins, A. G. (2008). Phylogenetics of Hydroidolina (Hydrozoa: Cnidaria). Journal of the Marine Biological Association of the UK, 88(08), 1663-1672. doi:10.1017/S0025315408002257
We will consider 18S, the small nuclear ribosomal RNA. This is not a protein coding gene, it is an RNA component of the ribosome.
The sequences are in the file 18s.raw.fasta, available
on canvas. Download it, then open it in a text editor like atom and
inspect the contents. Here are the first few lines:
>gi|15282083|gb|AF358068.1| Nectopyramis sp. AGC-2001 small subunit ribosomal RNA gene, partial sequence
GACAACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATA
AGCACTTGTACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATCGTTTATTTGATTGTACTTTACTAC
ATGGATACCTGTGGTAATTCTAGAGCTAATACATGCGAAAAATCCCAACTTCTGGAAGGGATGTATTTAT
...
The line that starts with the > is the header line,
this is the name of the sequence. These sequences were downloaded from
NCBI and have long complex
names that include several identifiers, the species name, and the gene
name. Following that are multiple lines of sequence data, and then the
next header and sequences etc…
These sequences are not aligned - they are assembled gene sequences for each species.
If we proceeded with the data as-is, the whole header would be used as the taxa names, which would make for a seriously ugly tree. So let’s clean up those headers. Rather than use a specialized tool, we will use a powerful form of search-and-replace called regular expressions to reformat them.
We will search for:
>gi\|([0-9]+)\|.+\| ([a-zA-Z]+) ([a-zA-Z]+)[\. ].+
and replace it with:
>\2_\3_\1
The end result is a fasta file where the sequences are untouched, but
the headers have the format >genus_species_gi, where the
gi is a unique identifier for each sequence. I leave that
in there since there can be multiple sequences from the same species. If
we were using just >genus_species, then there would be
multiple taxa with the same names which makes downstream tools choke.
Including the gi guarantees that every taxon name is
unique. It also helps keep track of where data came from.
There are many programs that implement regular expressions, below are descriptions for a couple. Pick whichever you like.
sed at the command lineMost unix like-systems, such as macOS or the Ubuntu you can install
on Windows with WSL, have a sed command for
Shorten the headers with the following command:
sed -E 's/>gi\|([0-9]+)\|.+\| ([a-zA-Z]+) ([a-zA-Z]+)[\. ].+/>\2_\3_\1/g' 18s.raw.fasta > 18s.fasta
This command has a few parts:
sed is the program we are using-E 's/>gi\|([0-9]+)\|.+\| ([a-zA-Z]+) ([a-zA-Z]+)[\. ].+/>\2_\3_\1/g'
defines the search and replace. Specifically,
>gi\|([0-9]+)\|.+\| ([a-zA-Z]+) ([a-zA-Z]+)[\. ].+ is
what we are searching for and >\2_\3_\1 is what we are
replacing it with.18s.raw.fasta is the input file name> 18s.fasta sends the output to a file called
18s.fastaCopy the file to a new file int he same directory, but without
.raw in the name. For example, 18s.fasta.
Open the new fasta file in VS Code. Click in the contents of the
fasta file to make sure it is active, and then select “Edit >
Replace”. Click on the .* button to enable regular
expressions.
In the Find box, enter:
>gi\|([0-9]+)\|.+\| ([a-zA-Z]+) ([a-zA-Z]+)[\. ].+
In the replace box, enter:
>$2_$3_$1
Notice that we use $ instead of \, as for
sed, for the replacing with captured text. That is a
difference in dialect between the two implementations of regular
expressions.
We will use the program mafft to align the sequences.
mafft 18s.fasta > 18s.aligned.fasta
You can install mafft and do this locally, or adapt a job script and do it remotely on a cluster.
TASK: Align the sequences locally or on a remote cluster with the above command. If running remotely, you will need to load the MAFFT module and schedule a job.
After alignment, you will have a new file
18s.aligned.fasta. Open it in VS Code. It is a fasta file,
with the same headers as 18s.fasta. But mafft
as inserted gaps, denoted by -, to line up homologous
nucleotides into the same columns. The gaps are needed because of
insertion and deletion events that happen in the course of evolution.
These are a distinct type of change from the nucleotide substitutions we
have modeled.
There are many tools that can be used to view multiple sequence
alignments, including mesquite. Rather than
install these, though, we will use an online viewer. Head over to MView. From the top
dropdown menu, select DNA. Click Chose File
and upload your 18s.aligned.fasta file. Then click
Submit. Within a few minutes, you will get a nice color
coded view of your alignment.
Problem 1: Answer the following questions about your alignment (1-2 sentences for each is fine).
Does it appear that the rate of evolution is similar at all sites? Explain your reasoning.
Answer: No, some sites are more variable than others, with more insertions/deletions per species, while others are pretty much the same for each species. This indicates that some sites have a higher rate of evolution than others.
Describe the distribution of gaps - are they uniformly distributed, or do they group together? Why do you think this is?
Answer: The gaps tend to form clusters, likely because some portions of DNA are more prone to insertions/deletions than others. This is likely due to differenes in the necessity of certain portions of DNA for the function of the gene. For example, the start and stop codons are likely more important than the middle of the gene, so the middle of the gene is more prone to insertions/deletions.
Problem 2: Create a job script to infer a phylogeny from this alignment. Download it to your computer, and insert code in the chunk below to view it.
Copy an iqtree job script from a previous exercise and
edit it to analyze the alignment you created. Or run iqtree locally. The
key change is to update the name of the input file, and put the output
files in this directory.
# Insert code here to view the phylogeny
phy = read.tree("18s.aligned.fasta.treefile")
plot(phy)
If the tree is too squished to view, adjust the width and height in the chunk options to give it more space.
I have provided all the input you need for analyses so far. Usually, though, you will need to gather the data yourself, be it by collecting the sequences from organisms you collect or downloading them from a public archive. Here you will download them from a public archive.
Ideally this would be a simple process - we would just specify the name of the clade we are interested in and the name of the gene we want to download for all members of that clade. But public archives are far messier places than one would hope.
The primary challenge you will find is that, while taxon name usage is fairly consistent, there is no uniform system for gene names. That means it really isn’t even worth searching sequences by gene names when building a phylogeny. If you can’t search by gene name, though, how are you to find the gene you want? The answer is to search for sequences themselves. Pick an example sequence for the gene you want, and use that as bait to fish out other sequences for that gene in your clade of interest.
Here we will stick with 18s. Specifically, let’s look at the very
first sequence in our 18s.raw.fasta file:
>gi|15282083|gb|AF358068.1| Nectopyramis sp. AGC-2001 small subunit ribosomal RNA gene, partial sequence
GACAACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATA
AGCACTTGTACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATCGTTTATTTGATTGTACTTTACTAC
ATGGATACCTGTGGTAATTCTAGAGCTAATACATGCGAAAAATCCCAACTTCTGGAAGGGATGTATTTAT
TAGATTAAAAACCAATGCGAATCTTCGGGTTCGTTTTCTTGGTGATTCATGATAACTTTTCGAATCGCAT
GGCCTAGCGCCGGCGATATTTCATTCAAATTTCTGCCCTATCAACTGTCGATGGTAAGGTATTGGCTTAC
CATGGTTATAACGGGTGACGGAGAATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACA
TCTAAGGAAGGCAGCAGGCTCGAAAATTACCCAATCCCGACTCGGGGAGGTAGTGACAAGAAATAACGAT
ACGGGGTCTTAATAGGTCTCGCAATTGGAATGAGTACAATTTAAATCCTTTAACGAGGATCAATTGGAGG
GCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAA
AGCTCGTAGTTGGATTTCGGAGTGGGCCAGTTGGTCCGCCGCAAGGTGTGTTACTGACTGGTTTGCTCTT
CTTCGCAAAGACTGCGTGTGCTCTTCGTTGAGTGTGCGTAGGATTTACGACGTTTACTTTGAAAAAATTA
GAGTGTTCAAAGCAGGCTATTGCTTGAATACATGAGCATGGAATAATGGAATAGGACTTTGGTCCCATTT
TGTTGGTTTCTAGGACCGAAGTAATGATTAAGAGGGACAATTGGGGGCATCCGTATTTCGTTGTCAGAGG
TGAAATTCTTGGATTTACGAAAGACGAACAACTGCGAAAGCACTTGCCAAGAGTGTTTTCATTAATCAAG
AACGAAAGTTAGAGGATCGAAGACGATCAGATACCGTCCTAGTTCTAACCATAAACGATGTCGACTAGGG
ATCAGCGGGCGTTATCGTACGACCCCGTTGGCACCTTACGGGAAACCAAAGTTTCTGGATTCCGGGGGAA
GTATAGTCGCAAGGCCGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGTGGCTCA
ATTTGACTCAACACGGGAAAACTTACCAGGTCCAGACATAGTAAGGATTGACAGGTTGAGAGCCCTTTCT
TGATTCTGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGGTTAATTCCGTTAAC
GAACGAGACCTTAACCGGCTAAATAGTCATGCGATTCTCGAATCGTAACTGACTTCTTAGAGGGACTGTT
GGTGTTTAACCAAAGTCAGGAAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGC
GCTACACTGTCGGATTCAGCGAGTCTTAACCTTAACCGAAAGGTTTGGGTAATCTTTTGAAAGTCCGACG
TGATGGGGATTGATCATTGCAATTATTGATCATGAACGAGGAATTCCTAGTAAATGCGAGTCATCAGCTC
GCGTTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTACTACCGATTGAATGGTTTAGTGAGAT
CTTCGGATTGGCGCCGTCGCGGCCTCACGGAAGTGATGGTGCCGAAAAGTTGCTCAAACTTGATCATTTA
GAGGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCAGAAGGATCAAGCTTGGATCCCGGG
Now pick a clade that you want to infer a tree for. Head on over to
the NCBI
Taxonomy Browser. Click the Nucleotide box since we are
looking at DNA nucleotide sequences, and then click Go.
Then browse to or search for your clade. A few things to keep in
mind:
Despite these shortcomings, it is generally good enough for gathering together sequences.
We are going to go for a tree that has around 100 tips in it, so be sure that the number of nucleotide sequences is at least a few times larger than that (since not all the sequences are 18s, though it is one of the most commonly sequenced genes in studies that use PCR for enrichment).
Click on your clade, and then click on it again at the top of the page that loads. You are looking for a taxonomy ID. This number is a unique identifier for the clade.
Note the name and id here:
Clade name: Capitata
Taxonomy id: 576756
We will use an NCBI tool called blast to search for 18S sequences for your clade of interest. Head over to the blastn page, which is for searching for nucleotide sequences with a nucleotide query.
Enter the following:
Paste the bait sequence above, for Nectopyramis, into the box
lebeled
Enter accession number(s), gi(s), or FASTA sequence(s). You
can paste it with or without the header.
Paste the taxonid you got above for your clade into the
Organism box.
In the Optimize for section, select
Somewhat similar sequences (blastn).
Now click the Blast button at the bottom and wait for a result! It should take a few seconds to a few minutes.
If you get an error that your taxonid is invalid, then pick another (for example the parent group). Some taxaid are not available in blast filters - I have no idea why.
When your blast search finishes, click the link
MSA viewer. This will show you a multiple sequence
alignment for all the similar sequences it found in this clade. Instead
of showing each letter, it shows a colored line for each nucleotide that
is variable.
Note if any sequences are very poorly aligned or are super short
(less than about 50% of the alignment length). If they are, click the
Rows button, unselect the problematic sequences, and click
apply.
Once you have refined your sampling, click Download and
then FASTA Alignment. This will download a multiple
sequence alignment in fasta format with the .aln extension.
Rename the file alignment.raw.fasta, removing the
.aln extension.
Open the alignment in atom, and delete the first sequence (including the header). This first sequence is your query that you used as bait to get the other sequences.
Upload this alignment to your exercise folder.
The sequences are aligned, so we don’t need to run them through mafft. But they have crazy names still, which we need to shorten. Here is a sed command for doing that (it is a bit different from the one above, since the headers have a different format)
sed -E 's/>[a-zA-Z]+\|([a-zA-Z0-9\.]+)\|.+\[organism=([a-zA-Z0-9]+) ([a-zA-Z0-9]+).+?/>\2_\3_\1/g' alignment.raw.fasta > alignment.fasta
This will create a new file, alignment.fasta, with your
renamed headers.
Or use VS Code. Copy the alignment file to
alignment.fasta, and open it in VS Code. This is the Find
term:
>[a-zA-Z]+\|([a-zA-Z0-9\.]+)\|.+\[organism=([a-zA-Z0-9]+) ([a-zA-Z0-9]+).+?
and this is the Replace term:
>$2_$3_$1
Problem 3: Generating this file
alignment.fasta, and including it in your submission, satisfies problem 3.
Create a new job script called job_clade.sh that runs
iqtree on this alignment. Submit it to the queue. Once it finishes,
download the full exercise_05 folder and add this file to
it
Problem 4: Insert a code chunk below to show the tree you inferred for your clade.
# Insert code here to view the phylogeny
phy = read.tree("alignment.fasta.treefile")
plot(phy)
Knit this document to confirm that the trees show as they should in the output (again adjusting height and width as needed to get them to fit).
This is a very rough pass at building a phylogeny. For an actual analysis intended for publication, there would be a lot more to do, such as:
But it gives a general overview of some key steps.
Compress the exercise folder, including all input and output files, and submit it on canvas.